Task 2


Clustering

Kaggle

Dataset description

Context

Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Content

Attributes

People

Products

Promotion

Place

Target

Import required libraries

sns.set_context("paper", rc={"font.size":12, "figure.titlesize":18, "axes.titlesize":15, "axes.labelsize":13, "xtick.labelsize": 13, "ytick.labelsize": 13, "legend.fontsize": 9, "legend.title_fontsize": 11}) # EDA

EDA

First look

There are two columns that are not mentioned in dataset decription: Z_CostCount and Z_Revenue

As long as there are only one values in both columns, i can delete them

Duplicates

There are 358 duplicates and only 358-182 = 176 unique occurances, so there are even duplucates with 3 or more copies. I guess they are not valid, because i don't think that it could be possible to have absolutely equal customers. Equal amount spent on different types of products, number of pruchases, number of website visits in last month and all other features... Without dataset author i can't be sure in their validity or invalidity, but i will drop them

I have no idea in which year this dataset was collected, because the dataset creator doesn't provide us with this information

So let's assume that the dataset was collected on the next day of the last customer enrollment + 2 years, because most of the features are aggregated for last 2 years.

There are time-connected features: Year_Birth and Dt_Customer. I will transform Year_Birth into Age feature by substracting year of birth from 2016. And it also nice to transform datetime feature Dt_Customer into integer CustomerFor, which is the amount of days since customer enrollment

Age distribution

There are really old customers ._.

I think they are missclicked or something while entering year of birth. Everything else looks okay, so i will drop them out while training the model, but will include them in the dataset to predict

Income

There are some outliers. Let's look on them.

There is no reason to consider this data invalid. But while using clutering methods, outliers can negatively affect. I'll also drop these rows while training

Filling nulls

Let's impute this NaN's with median

Education

Graduation and 2n Cycle are not clear values.

All countries conveyed their national systems to a two cycle structure consisting of a first (undergraduate) and a second (graduate) cycle. Source: EHEA

According to three Cycle System from the European Higher Education Area, 2n Cycle refers to Master degree. And Graduation means that the person is on the second - graduate cycle, so in fact he finished the first - undergraduate cycle (in many countries named Bachelor)

So the changes are as follow:

Let's see how the Income varies across different education degrees

Bachelor, PhD and Master degrees Income is around the same, but the Basic degree Income is definetly lower than others

Marital status

We can merge Alone to Single category, but YOLO and Absurd are not clear.

YOLO

YOLO (You only live once) accords to the lifestyle or trend that many young people have adopted as a way to better enjoy life, and not to think about saving up for the future.

I can assume that YOLO category refers to people who do not have a permanent partner, so i will merge it to Single

Absurd

In philosophy, "the Absurd" refers to the conflict between the human tendency to seek inherent value and meaning in life, and the human inability to find these with any certainty.

I would also merge Absurd to Single

Now let's look on the marital statuses proportions

I have an idea to combine the statuses [Single, Widow, Divorced ] and [Together, Married], because the client, as a consumer, is better described not by a specific status, but by the presence of a partner

So the most customers are in relationships

Kidhome and Teenhome

Let's consider all children in the household with NumChildren feature

I would also introduce feature HasChildren which equals 1, if customer has one or more children, and equals 0 if customer doesn't have children

We see that the most customers have 1 child

Customers without children have bigger income

Amount spent

Wines and Meat products are the most spent on

Let's introduce MntTotal feature, which is the total amount spent by customer in the last 2 years

And i will calculate the percent of amount spent on each product type from total amount spent for each customer

Let's analyze this features in terms of education

Customers with PhD, Bachelor and Master degree mostly spend on Wine and Meat products. Also have to notice, that more than a half of PhD's spendings are Wine produts (in median)

And Basic degree customers spend more on Gold, Fish, Sweets

Now let's check the total amount spent depending on education degrees and number of children

Customers with PhD degree spent the most amount in last 2 years, customers with Basic degree - the least amount. That corresponds to the Income distribution

And the amount spent by parents and not parents differs a lot

Number of purchases

The most purchases are from store

There is interesting insight that customers with Basic education degree have more website visits per month than others:

Despite that, they do not buy more on the websites. The might just monitoring the deals. So the store could publish more deals on the site to force them buy more

Same thing with customers, that have children:

The feature NumTotalPurchases is the sum of all purchases made by a customer

There are 6 customers with 0 purchases, but total amount spent is not 0. Seems like incorrectly collected data, let's delete these rows.

Let's look on the correlation between CustomerFor and NumTotalPurchases

Seems like there is no correlation between CustomerFor and NumTotalPurchases. So i assume that information about number of purchases is also collected in the last 2 years, and we can evaluate customer's activity with NumTotalPurchases

AvgCheck is the average check of the customers purchases

There is also one outlier with 1679 average check

Accepted campaigns

Let's add AcceptedTotal feature which is the number of accepted campaigns by customer

The majority of customers didn't accepted any campaign

There are no Basic degree customers, who accepted more than 2 campaigns

This graph illustrates us that the more campaigns was accepted by customers, the more likely it would be customer without children

Data Cleaning

Multivariate analysis

We can see different areas of objects on some scatterplots. MntTotal and NumTotalPurchases, for instance, or AvgCheck and Age

That happens, because NumTotalPurchases has multimodal distribution. And AvgCheck calculates depending on NumTotalPurchases

All correlations are clear and explainable

Data preprocessing

Feature scaling

The features and scaling method was chosen by iterative process of evaluating different combinations with silhouette score

By the way, $Box-Cox$ transformation:

image.png

PowerTransformer automatically select lambda by estimating through maximum likelihood.

The feature selected are correlated, but i guess this is not really a problem in k means clustering (source). In addition to that, all other uncorrelated features are not really interesting to cluster by

Clustering

K Means

Choosing the number of clusters

I will use the elbow rule and silhouette score visualizasion for choosing the optimal number of clusters (k)

I think 4 is the optimal number of clusters.

According to the elbow rule plot, 4 or 5 clusters could be optimal

Looking on the silhouette coefficient visualisation, 4 clusters provides relatively high silhouette scores for each cluster.

PCA visualisation

The model confuses cluster 2 and 3 a little, but that is not a big deal in general

PCA's eigenvectors explained variance:

One eigenvector explains around 90% of variance, and that is occured because of correlated variables

And now let's predict cluster labels on full dataset with outliers

Clusters analysis

Cluster 1 is the biggets cluster, around 1/3 of all customers. Clusters 2, 3 and 4 are around same sizes

Income

Cluster 1: low income

Cluster 2: high income

Cluster 3: very high income

Cluster 4: medium income

And the income outliers are distributed in 3rd cluster

Average check

Average check corresponds to income of clusters, but the gap between 3rd cluster and others is huge

Cluster 2 and 3 customers are the most active and frequent buyers, cluster 4 have medium frequency, and cluster 1 has low frequency of purchases

Education

The Basic degree is presented mostly in 1st cluster

Children

There are mostly parents in 1, 2, 4 clusters. And customers in 3rd clusters are mostly single

Customers from 2nd and 3rd clusters buy from catalog more than from 1 and 4 clusters. Maybe catalog products are new products and they are pretty expensive

Now let's look on number of website visits by clusters

1st and 4th clusters visits the website the most

As we see, popular products types are the same in all clusters: wine and meat. But cluster 3 buys Meat more than others. Cluster 1 buy Gold products in addition to Wine and Meat

Accepted Campaigns

We see that:

Complains

Cluster 1 purchases less, but complains more, thats interesting

Results

I would rate clusters as follow:

Platinum customers:

Gold customers:

Silver customers:

Bronze customers:

Aquired skills: